Improve summary chunking #397

MichaelClifford · 2024-04-30T15:12:49Z

Document chunking in summarizer.py was still failing with some models on some texts. (There were some model vocabularies that caused the model server to fail to return the detokenized data if specific token sequences were split into separate chunks by the chunking process).

To address this issue, we've implemented a new approach to chunking. Instead of chunking the raw tokens, we will chunk the text, like in our earlier implementation, and rely on extras/tokenize/count to confirm that the text chunk is within the token limit, if not, we split the text chunk in two.

I've manually evaluated this with the problematic and non-problemtatic models. It appears to work as well as before, and avoids any of the detokenization crashes we were experiencing.

Signed-off-by: Michael Clifford <mcliffor@redhat.com>

Gregory-Pereira

/lgtm, but I had some questions. I also haven't tested this, but if you are confident it works we can ignore that concern

recipes/natural_language_processing/summarizer/app/summarizer.py

Gregory-Pereira

/LGTM

Signed-off-by: Michael Clifford <mcliffor@redhat.com>

improve summary chunking

26b99b9

Signed-off-by: Michael Clifford <mcliffor@redhat.com>

MichaelClifford requested review from rhatdan, sallyom, lmilbaum, cgwalters and Gregory-Pereira as code owners April 30, 2024 15:12

Gregory-Pereira reviewed Apr 30, 2024

View reviewed changes

recipes/natural_language_processing/summarizer/app/summarizer.py Show resolved Hide resolved

recipes/natural_language_processing/summarizer/app/summarizer.py Show resolved Hide resolved

Gregory-Pereira approved these changes Apr 30, 2024

View reviewed changes

MichaelClifford merged commit 0b16817 into containers:main Apr 30, 2024
1 check passed

sallyom pushed a commit to sallyom/locallm that referenced this pull request May 1, 2024

improve summary chunking (containers#397)

53b0eec

Signed-off-by: Michael Clifford <mcliffor@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve summary chunking #397

Improve summary chunking #397

MichaelClifford commented Apr 30, 2024 •

edited

Loading

Gregory-Pereira left a comment

Gregory-Pereira left a comment

Improve summary chunking #397

Improve summary chunking #397

Conversation

MichaelClifford commented Apr 30, 2024 • edited Loading

Gregory-Pereira left a comment

Choose a reason for hiding this comment

Gregory-Pereira left a comment

Choose a reason for hiding this comment

MichaelClifford commented Apr 30, 2024 •

edited

Loading